Author information

Name: Ostap Kalapun

Affiliation: student

E-mail:

Research overview

I’m going to reanalyze the study by Schlotawa L, Tyka K, Kettwig M, Ahrens-Nicklas RC et al. Drug screening identifies tazarotene and bexarotene as therapeutic agents in multiple sulfatase deficiency. EMBO Mol Med 2023 Mar 8;15(3):e14837. PMID: 36789546

The research explored treatments to Multiple sulfatase deficiency (MSD) disease, caused by the mutation in the SUMF1 gene, that results in an inability to breakdown sulfated substances. Out of all 785 tested drugs only 13 surpassed the baseline.

This, several experiments conducted transcriptome and pathway analysis. My focus is at the one of finding specific genes that are responsible for restoring sulfatase activity. The experiment included 7 different fibroblast lines that were treated with tazarotene (increases expression of retinoids genes and increases sulfatase activities), adapalene (increases expression of retinoids genes but doesn’t increase sulfatase activities), and DMSO (control group, base compound of all drugs).

The reads were sequenced using an RNA depletion technique.

Retrieving the data

Load the counts

# raw counts downloaded from GEO
# https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE205555

counts_csv <- "tables/counts.csv"
if (file.exists(counts_csv)){
    counts <- read.csv(counts_csv)
}else{ # this chunk extracts counts and then normalizes them with TPM method if they aren't already organized into .csv-files
    # read counts files for every sample (2 columns - gene ID and its expression count)
    files <- fs::dir_ls(path="../data/input_geo", glob="*exonCounts.txt")
    # create a tibble out of the raw files
    counts <- readr::read_tsv(files, id="path", col_names=c("ensembl_gene_id", "raw_counts"))
    df_split <- str_split_fixed(counts$path, "_", 13) %>% as.data.frame()
    counts$sample <- df_split$V2 
    counts <- counts %>% 
        mutate(sample=str_replace(sample, "geo/", "")) %>% 
        dplyr::relocate(sample) %>% dplyr::select(-path) %>% 
        pivot_wider(names_from="sample", values_from="raw_counts")
    
    # save raw organized counts to .csv-file
    write_csv(counts, counts_csv)
    counts <- counts %>% column_to_rownames("ensembl_gene_id")
}

Load the metadata

# the metadata table was parsed from the GEO accession webpage by taking the sample's ID and name preview
metadata <- read.csv("tables/metadata.csv", row.names = 1)

Gene annotation

ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")

# get genes info via biomaRts
biomart_list <- getBM(filter="ensembl_gene_id", 
                      attributes=c("ensembl_gene_id", 
                                     "hgnc_symbol", 
                                     "transcript_biotype", 
                                     "description"),
                      values=counts$ensembl_gene_id, mart=ensembl)

# we subset only protein coding genes and miRNA
biomart_list_filtered <- biomart_list %>% dplyr::filter(transcript_biotype %in% c("protein_coding",
                                                                                  "miRNA"))
# select counts corresponding to retrieved biomaRt list
counts_filtered <- counts %>% dplyr::filter(ensembl_gene_id %in% biomart_list_filtered$ensembl_gene_id)

write.csv(counts_filtered, "tables/counts_filtered.csv", row.names=FALSE)
write.csv(biomart_list_filtered, "tables/gene_names_mapped.csv", row.names=FALSE)

Normalization

counts_filtered <- counts_filtered %>% tibble::column_to_rownames("ensembl_gene_id")
# here we perform differential expression analysis in order to calculate
# overall mean expression levels of genes;
# the design formula implies that we created an intercept-only model
dds <- DESeqDataSetFromMatrix(countData=counts_filtered, 
                              colData=metadata,
                              design=~1)
# normalization of the counts with the variance stabilizing transformation
counts_filtered_vst <- vst(dds, blind=TRUE) %>% assay() %>% as.data.frame()

Exploratory data analysis

PCA

For fibroblast_line factor

biplot(pca_calculated, colby="fibroblast_line", legendPosition="right", lab=NULL)

We see that samples from the line E are distant from the other lines, as well as the line A, which might be a signal that they have a different expression profile. Let alone that some samples are far away from their main cluster. There’s a slight chance that these samples are outliers. In general, all other lines are in the vicinity and the samples inside each fibroblast line are, essentially, clustered together with slight sparseness. That’s a good sign, which means that the samples were handled correctly during the experiment and have the same expression across the samples.

For treatment factor

biplot(pca_calculated, colby="treatment", legendPosition="right", lab=NULL)

The situation with treatment is much more interesting. Treatments are scattered around the plot but we can clearly see that they still clustered together, unlike the DMSO samples. This suggests that the drugs really contribute to changing expression profile in the cell so it’s worth analyzing further results.

Heatmap (Hierarchial clustering by samples)

matrix_correlations <- cor(counts_filtered_vst)

f1 <- pheatmap(matrix_correlations, 
               annotation=metadata, 
               show_rownames=FALSE,
               show_colnames=FALSE)

From the graph above it’s evident that samples from the line F are more similar to samples from the line G than to those from, say, A, C, or E. Also, the replicates are distributed equally acrosss each fibroblast line as it was due (that’s great, we don’t have any bias in that way).

Differential expression analysis

I apply the Wald significance test in this experiment because we aim to observe paired comparisons between groups, in this case, fibroblast lines and treatments, which is shown by the design formula.

dds <- DESeqDataSetFromMatrix(countData=counts_filtered, 
                              colData=metadata,
                              design=~fibroblast_line+treatment)

dds_tested <- DESeq(dds, test="Wald")
genes_mapped <- read.csv("tables/gene_names_mapped.csv")

# lfcShrink scales log2FoldChanges to attenuate the expression of genes with imprecise fold changes 
res_tazarotene_adapalene <- lfcShrink(dds_tested,
                                      type="ashr",
                                      contrast=c("treatment", "tazarotene", "adapalene")) %>% as.data.frame() %>% tibble::rownames_to_column("gene")
tmp <- genes_mapped %>% dplyr::filter(ensembl_gene_id %in% res_tazarotene_adapalene$gene) %>% dplyr::rename(gene=ensembl_gene_id)

res_tazarotene_adapalene_significant <- left_join(res_tazarotene_adapalene, tmp, by="gene") %>% dplyr::select(-gene) %>% dplyr::rename(gene=hgnc_symbol)

results <- res_tazarotene_adapalene_significant %>% distinct(gene, .keep_all = TRUE)

results_top <- results[order(results$padj),][1:1000,]
results_top$diff_expressed <- ifelse(results_top$log2FoldChange > 1, "UP", ifelse(results_top$log2FoldChange < -1, "DOWN", "NO"))
results_top$diff_expressed <- as.factor(results_top$diff_expressed)

volcano <- ggplot(data=results_top, 
                  aes(x=log2FoldChange, 
                      y=-log10(padj), 
                      col=diff_expressed, 
                      label=gene)) + 
          geom_point() +
          theme_minimal() +
          scale_color_manual(values=c("brown2", "grey", "chartreuse3")) +
          labs(title="Tazarotene vs Adapalene") +
          geom_text_repel() +
          geom_vline(xintercept=c(-1, 1), col="black", linetype="dotted") +
          geom_hline(yintercept=-log10(0.01), col="black", linetype="dotted")

ggplotly(volcano)
EnhancedVolcano(res_tazarotene_adapalene_significant,
                lab = res_tazarotene_adapalene_significant$gene,
                x = "log2FoldChange",
                y = "padj",
                title = "Tazaroten vs Adapalene",
                pCutoff = 0.01,
                FCcutoff = 1,
                ylim = c(-2, 15))

I had to lower log2FoldChange conventional thresholds to -1 and 1, which is still appropriate. The reason for this might be overall log2FoldChange shrinkage step. Thus, we found that there’re 5 significantly expressed genes, out of which 3 are downregulated (KRT17, CLDN1, FLG) and 2 are upregulated (H2BC14, AKR1B10).

Functional analysis

# loaded the Hallmark database
hallmark <- gmtPathways("tables/h.all.v2023.1.Hs.symbols.gmt")
tazarotene_adapalene <- results %>% dplyr::filter(!is.na(padj)) %>% dplyr::filter(!is.na(log2FoldChange))

# I ranked genes with the log2FoldChange*abs(-log10(padj)) ranking system
ranked <- tazarotene_adapalene %>% mutate(rank=(log2FoldChange * abs(-log10(padj)))) %>% 
  dplyr::select(gene, rank) %>% tibble::deframe()

# then I searched for pathways in the Hallmark database that correspond to the genes with the highest normalized enrichment score
fgsea_res <- fgsea(pathways=hallmark, stats=ranked)
fgsea_res %>% dplyr::filter(padj<0.25) %>% arrange(desc(NES))
##                                pathway         pval         padj   log2err
## 1:  HALLMARK_INTERFERON_ALPHA_RESPONSE 1.012456e-05 0.0005062281 0.5933255
## 2:       HALLMARK_BILE_ACID_METABOLISM 2.101237e-02 0.2101236638 0.3524879
## 3:                HALLMARK_P53_PATHWAY 1.240600e-02 0.1550749424 0.3807304
## 4: HALLMARK_WNT_BETA_CATENIN_SIGNALING 2.057351e-03 0.0401366500 0.4317077
## 5:                    HALLMARK_HYPOXIA 2.408199e-03 0.0401366500 0.4317077
##            ES       NES size                               leadingEdge
## 1:  0.9044216  1.630156   90       RTP4,GBP4,PSMB9,PSMB8,UBA7,IRF1,...
## 2:  0.8106220  1.434033   82 ABCA6,DHCR24,RBP1,CYP27A1,FADS2,ACSL5,...
## 3: -0.7643562 -1.437616  182 KRT17,CLCA2,CCND2,MDM2,S100A10,CDKN1A,...
## 4: -0.9518560 -1.498511   36                           DKK1,CCND2,JAG1
## 5: -0.8143936 -1.528096  174   PDK1,AK4,SERPINE1,DUSP1,CITED2,LDHA,...

As a result of performing fast gene set enrichment analysis, I obtained 8 potential pathway functions that are associated with the expressed genes. They are: - Interferon-alpha response (immune response marker) - Bile acids methabolism - Interferon gamma response (immune response marker) - Glycolysis (glucose break down) - Heme metabolism (oxygen-carrying marker) - P53 pathway (stress signals marker) - Hypoxia (can’t provide sufficient amount of oxygen to maintain homeostasis) - Wnt/beta-catenin signaling (maintains cellular homeostasis)

Conclusion

The upregulation of genes involved in the interferon-alpha and interferon-gamma responses indicates an immune response in MSD. This activation may reflect the cellular stress due to the accumulation of sulfatides. Low oxygen conditions, related to P53 pathway and Heme metabolism, can also be a response to metabolic stress. The downregulation of genes like KRT17 (keratin 17), CLDN1 (claudin 1), and FLG (filaggrin) can impact cell barrier function, which could be relevant. On the other hand, the upregulation of H2BC14 (a histone gene) and AKR1B10 (an enzyme involved in detoxification) might show us that the cells attempt to cope with increased stress and the need for tissue repair and protection.

sessionInfo()
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=Ukrainian_Ukraine.utf8  LC_CTYPE=Ukrainian_Ukraine.utf8   
## [3] LC_MONETARY=Ukrainian_Ukraine.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=Ukrainian_Ukraine.utf8    
## 
## time zone: Europe/Kiev
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] writexl_1.4.2               ggplotify_0.1.2            
##  [3] PCAtools_2.12.0             knitr_1.45                 
##  [5] tximport_1.28.0             DEGreport_1.36.0           
##  [7] pheatmap_1.0.12             DESeq2_1.40.2              
##  [9] SummarizedExperiment_1.30.2 MatrixGenerics_1.12.3      
## [11] matrixStats_1.1.0           plotly_4.10.3              
## [13] fgsea_1.26.0                EnhancedVolcano_1.18.0     
## [15] ggrepel_0.9.4               ashr_2.2-63                
## [17] biomaRt_2.56.1              RColorBrewer_1.1-3         
## [19] ensembldb_2.24.1            AnnotationFilter_1.24.0    
## [21] GenomicFeatures_1.52.2      AnnotationDbi_1.62.2       
## [23] Biobase_2.60.0              GenomicRanges_1.52.0       
## [25] GenomeInfoDb_1.36.4         IRanges_2.34.1             
## [27] S4Vectors_0.38.2            AnnotationHub_3.8.0        
## [29] BiocFileCache_2.8.0         dbplyr_2.3.4               
## [31] BiocGenerics_0.46.0         lubridate_1.9.3            
## [33] forcats_1.0.0               stringr_1.5.1              
## [35] dplyr_1.1.4                 purrr_1.0.2                
## [37] readr_2.1.4                 tidyr_1.3.0                
## [39] tibble_3.2.1                ggplot2_3.4.4              
## [41] tidyverse_2.0.0             htmltools_0.5.7            
## 
## loaded via a namespace (and not attached):
##   [1] later_1.3.1                   BiocIO_1.10.0                
##   [3] bitops_1.0-7                  filelock_1.0.2               
##   [5] XML_3.99-0.16                 lifecycle_1.0.4              
##   [7] mixsqp_0.3-48                 edgeR_3.42.4                 
##   [9] doParallel_1.0.17             lattice_0.22-5               
##  [11] MASS_7.3-60                   crosstalk_1.2.1              
##  [13] backports_1.4.1               magrittr_2.0.3               
##  [15] limma_3.56.2                  sass_0.4.7                   
##  [17] rmarkdown_2.25                jquerylib_0.1.4              
##  [19] yaml_2.3.7                    httpuv_1.6.12                
##  [21] cowplot_1.1.1                 DBI_1.1.3                    
##  [23] ConsensusClusterPlus_1.64.0   abind_1.4-5                  
##  [25] zlibbioc_1.46.0               RCurl_1.98-1.13              
##  [27] yulab.utils_0.1.0             rappdirs_0.3.3               
##  [29] circlize_0.4.15               GenomeInfoDbData_1.2.10      
##  [31] irlba_2.3.5.1                 dqrng_0.3.2                  
##  [33] DelayedMatrixStats_1.22.6     codetools_0.2-19             
##  [35] DelayedArray_0.26.7           xml2_1.3.5                   
##  [37] tidyselect_1.2.0              shape_1.4.6                  
##  [39] farver_2.1.1                  ScaledMatrix_1.8.1           
##  [41] GenomicAlignments_1.36.0      jsonlite_1.8.7               
##  [43] GetoptLong_1.0.5              ellipsis_0.3.2               
##  [45] iterators_1.0.14              foreach_1.5.2                
##  [47] tools_4.3.1                   progress_1.2.2               
##  [49] snow_0.4-4                    Rcpp_1.0.11                  
##  [51] glue_1.6.2                    mnormt_2.1.1                 
##  [53] xfun_0.41                     withr_2.5.2                  
##  [55] BiocManager_1.30.22           fastmap_1.1.1                
##  [57] fansi_1.0.5                   rsvd_1.0.5                   
##  [59] digest_0.6.33                 truncnorm_1.0-9              
##  [61] gridGraphics_0.5-1            timechange_0.2.0             
##  [63] R6_2.5.1                      mime_0.12                    
##  [65] colorspace_2.1-0              RSQLite_2.3.3                
##  [67] utf8_1.2.4                    generics_0.1.3               
##  [69] data.table_1.14.8             rtracklayer_1.60.1           
##  [71] prettyunits_1.2.0             httr_1.4.7                   
##  [73] htmlwidgets_1.6.3             S4Arrays_1.0.6               
##  [75] pkgconfig_2.0.3               gtable_0.3.4                 
##  [77] blob_1.2.4                    ComplexHeatmap_2.16.0        
##  [79] XVector_0.40.0                ProtGenerics_1.32.0          
##  [81] clue_0.3-65                   scales_1.3.0                 
##  [83] logging_0.10-108              png_0.1-8                    
##  [85] ggdendro_0.1.23               rstudioapi_0.15.0            
##  [87] reshape2_1.4.4                tzdb_0.4.0                   
##  [89] rjson_0.2.21                  nlme_3.1-164                 
##  [91] curl_5.1.0                    cachem_1.0.8                 
##  [93] GlobalOptions_0.1.2           BiocVersion_3.17.1           
##  [95] parallel_4.3.1                restfulr_0.0.15              
##  [97] pillar_1.9.0                  grid_4.3.1                   
##  [99] reshape_0.8.9                 vctrs_0.6.4                  
## [101] promises_1.2.1                BiocSingular_1.16.0          
## [103] beachmat_2.16.0               xtable_1.8-4                 
## [105] cluster_2.1.5                 evaluate_0.23                
## [107] invgamma_1.1                  cli_3.6.1                    
## [109] locfit_1.5-9.8                compiler_4.3.1               
## [111] Rsamtools_2.16.0              rlang_1.1.2                  
## [113] crayon_1.5.2                  SQUAREM_2021.1               
## [115] labeling_0.4.3                fs_1.6.3                     
## [117] plyr_1.8.9                    stringi_1.8.2                
## [119] psych_2.3.9                   viridisLite_0.4.2            
## [121] BiocParallel_1.34.2           munsell_0.5.0                
## [123] Biostrings_2.68.1             lazyeval_0.2.2               
## [125] Matrix_1.6-4                  hms_1.1.3                    
## [127] sparseMatrixStats_1.12.2      bit64_4.0.5                  
## [129] KEGGREST_1.40.1               shiny_1.8.0                  
## [131] highr_0.10                    interactiveDisplayBase_1.38.0
## [133] broom_1.0.5                   memoise_2.0.1                
## [135] bslib_0.6.1                   fastmatch_1.1-4              
## [137] bit_4.0.5